Attach libraries
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.3 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.4 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(nycflights13)
library(here)
## here() starts at C:/Users/jethu/Documents/R Studio - Stat 383/Lecture Code/DS241Portfolioz
library(janitor)
##
## Attaching package: 'janitor'
##
## The following objects are masked from 'package:stats':
##
## chisq.test, fisher.test
View the flights data frame
df1=flights
Assign the flights data frame to variable df1
df1 <- flights
df1
Task 1: Flights in September from Miami
df2 <-df1 |>
filter((month == 9) & (origin == "MIA"))
df2
There are no flights from Miami in this dataset.
Task 2: Flights in September going to Miami
df3 <- df1 |>
filter(month == 9 & dest == "MIA")
df3
There were 912 flights going to Miami from a New York City airport in September 2013.
Task 4: Flights in January going to Miami
df4 <- df1 |>
filter(month == 1 & dest == "MIA")
df4
There were 981 flights going to Miami in the month of January.
Task 5: Flights in summer going to Chicago
There are two major airports in Chicago included in this data set as destination airports. Below is the airport name and its IATA code. This information was taken from https://www.cleartrip.com/tourism/airports/chicago-airport.html,
Ohare International Airport: ORD
Chicago Midway International Airport: MDW
Since “summer” as presented in the task was not specifically defined, one way we can define it is with the general months widely associated with summer, which are June, July, and August.
df5 = df1 |> filter((time_hour >= "2013-06-21") & (time_hour <= "2013-09-22")) |> filter((dest == "ORD") | (dest == "MDW"))
There were 5,770 flights to either Ohare or Midway airports in the summer months of June, July and August.
Task 6: Delays associated with flight number 86
flight_nums = df1 |> filter((month == 9) & (dest == "MIA"))
flight_nums = unique(flight_nums$flight)
df6 = df1 |>
filter(flight == min(flight_nums), dest == "MIA")
df6 |>
ggplot(aes(x=dep_delay, y=arr_delay)) + geom_point()
#> “Does the deprature delay change accross time of day (later in the day has more delays) #> Is flight time pattern affected by tiem of year #> Is departure delay affected by time of year
Note to self Note to self Note to self
Flights from NYC to Miami
df1 |>
filter(dest == "MIA") |>
count(origin,sort=TRUE)
Let’s Check!
I want to examine whether the flight time is impacted by delayed departure
I want to compare flight time to planned flight time. So we create a new variable
flt_delta=arr_delay-dep_delay
An flight that arrives 10 minutes late, if it departed on time, had a “delta” of 10 minutes
The origin of flights from strictly La Guardia
df7=df1 |>
filter(dest == "MIA", origin=="LGA") |>
mutate(flt_delta=arr_delay-dep_delay)
The origin of flights from strictly La Guardia in Scatter plot Format
df7 |>
ggplot(aes(x=dep_delay, y=flt_delta)) + geom_point(alpha=.1)
## Warning: Removed 79 rows containing missing values (`geom_point()`).
The origin of flights from strictly La Guardia (y-intercept drawn at
the mean flt_delta,na.rm)
df7 |>
ggplot(aes(x=dep_delay, y=flt_delta)) + geom_point(alpha=.1)+geom_hline(aes(yintercept=mean(flt_delta,na.rm=TRUE)))
## Warning: Removed 79 rows containing missing values (`geom_point()`).
##Is deprature delay affected by time of day? ## Let’s check below:
df7 |>
ggplot(aes(x=time_hour, y=dep_delay)) + geom_point(alpha=.1)+stat_smooth()+ylim(-25,120)
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
## Warning: Removed 168 rows containing non-finite values (`stat_smooth()`).
## Warning: Removed 168 rows containing missing values (`geom_point()`).
Why are delays bigger in December, than in January ? – It’s
probably not due to weather, i’d probably be too cold
The origin of flights from strictly La Guardia (x-intercept = by hour_minute, and y-intercpet by delays
df7 |>
ggplot(aes(x=hour + minute/60, y=dep_delay)) + geom_point(alpha=.1)+stat_smooth()+ylim(-25,120)
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
## Warning: Removed 168 rows containing non-finite values (`stat_smooth()`).
## Warning: Removed 168 rows containing missing values (`geom_point()`).
##Observation departure delay increases across flight day The origin of flights from strictly La Guardia (colored)
df7 |>
mutate(day_of_week=weekdays(time_hour))|>
ggplot(aes(x=hour + minute/60, y=dep_delay, color = day_of_week)) + geom_point(alpha=.1)+stat_smooth()+ylim(-25,120)
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
## Warning: Removed 168 rows containing non-finite values (`stat_smooth()`).
## Warning: Removed 168 rows containing missing values (`geom_point()`).
The origin of flights from strictly La Guardia (breaking data into
various subsets or faceting data applied by facet_wrap
function)
df7 |>
mutate(day_of_week=weekdays(time_hour))|>
ggplot(aes(x=hour + minute/60, y=dep_delay, color = day_of_week)) + geom_point(alpha=.1)+stat_smooth()+ylim(-20,40) + facet_wrap(~day_of_week)
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
## Warning: Removed 478 rows containing non-finite values (`stat_smooth()`).
## Warning: Removed 478 rows containing missing values (`geom_point()`).
Download the ZipFile “DL_SelectFields.zip”, by going to this website: https://www.transtats.bts.gov/DL_SelectFields.aspx?gnoyr_VQ=FIM&QO_fu146_anzr=Nv4%20Pn44vr45, filter by ” Filter Geography All, Filter Year 2022, Filter Period All Months.
After your done save the information into folder data_raw, then use line of code below, to extract the files information
WARNING: Make sure the file path matches with that of the one below, for no errors.
thisfile=here("data_raw","DL_SelectFields.zip")
df2022=read_csv(thisfile) %>% clean_names()
## Rows: 406956 Columns: 45
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (18): UNIQUE_CARRIER, UNIQUE_CARRIER_NAME, UNIQUE_CARRIER_ENTITY, REGION...
## dbl (27): DEPARTURES_SCHEDULED, DEPARTURES_PERFORMED, PAYLOAD, SEATS, PASSEN...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Cleans up data format ##Subsetting to data of intrest
Let’s focus on flights from La Guardia (LGA) Airport, and eliminate cargo flights by requiring at least 1 passenger per flight calls the resultant dataframe “df9”
df9=df2022 |> filter(passengers > 0, origin == "LGA")
##Let’s display the data onto a bar, because the best way to communicate data is to visualize it!
df9 |>
ggplot(aes(month))+geom_bar()
By default, “geom_bar” is counting the number of rows, where we have
asked it to visualize the count by “months”
Take a look at the dataset, and discover why counting rows is not going to give us a count of flights.*
Displaying the data we want: ##P.S We weight each row, by a certain value
df9 |>
ggplot(aes(month))+geom_bar(aes(weight=departures_performed))
Make some observations about this plot
##A new visualization
df9 |>
ggplot(aes(month))+geom_bar(aes(weight=passengers))
##Observation most passengers numbers most probably affected due to
the covid-19 pandemic
Here’s a more colorful plot below:
df9 |>
ggplot(aes(month,fill=carrier_name))+geom_bar(aes(weight=departures_performed))
##Arrivals and departures from la Guardia
df10 = df2022 |> filter(passengers > 0, origin == "LGA" | dest =="LGA")
df10 |> ggplot(aes(month)) + geom_bar(aes(weight=passengers))
We select month, passengers, seats, carrier name, destination, and
origin of the flight for “df11”.
df11 = df10 |>
select (month,passengers, seats,carrier_name, dest, origin)
We also have an id
df12 = df10 |> select(1:5, month, contains("id"))
We display the flights from various airlines on faceted histograms
df13 = df11 |> mutate(percent_loading = passengers/seats*100)
df13 |> ggplot(aes(percent_loading)) + geom_histogram()+facet_wrap(~carrier_name, scales = "free_y")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
##Obervation Delta and Endeavour Air Inc. seem to be the most used flights during this time period (2022)